Stats Review¶

These notes are paraphrased and shortened content from Castella and Berger (CB), Chapter 5.

As needed, these notes will also be updated throughout the semester. I want to only give you what you need right now!

For most interesting questions, we want to focus on the whole population of interest

How is everyone going to vote in the 2022 midterm elections?
What is the percentage of hateful content amongst all tweets?

Unfortunately, we often don't look at data from the whole population. Instead, we sample from it, and look at the sample. This is, generally speaking, a horrible idea and often goes terribly wrong.

Exercise ... why? What are some examples of how this could go wrong?

However, if we are very careful about how we sample, and we sample enough data, we can make statements about the whole population given data only on a sample of it. The process of doing so is where the field of statistics comes in.

Sample and IID¶

Random variables $X_1,...,X_n$ are called a random sample of size $n$ from the population $f(x)$ if $X_1,...,X_n$ are mutually independent random variables and the marginal pdf/pmf of each $X_i$ is the same function $f(x)$.

Alternatively, we can (and will) in this case call $X_1,...,X_n$ independent and identically distributed random variables with pdf/pmf $f(x)$.. Or, for short, a set of IID random variables. (CB Def'n 5.1.1)

Definition of a statistic¶

Let $X_1, ..., X_n$ be a set of IID random variables. Let $T(x_1, ...,x_n)$ be a real- or vector- valued function (whose domain includes the sample space of $(X_1,...,X_n)$. Then we call $Y= T(x_1, ...,x_n)$ a statistic.

Exercise: What are some statistics we might be interested in computing for a set of Twitter users?

Sampling Distribution¶

The probability distribution of a statistic $Y$ is called the sampling distribution of $Y$. (CB 5.2.1)

Sample Mean and Variance¶

The sample mean is the average of the values in a random sample. It is usually denoted by:

$$ \bar{X} = \frac{X_1+...+X_n}{n} = \frac{1}{n} \sum_{i=1}^{n} X_i$$

The sample variance is the statistic defined by:

$$ S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2$$

Law of Large Numbers¶

Informally - as your sample size $n$ increases, the mean of your sample gets very close to the mean of your population.

Formally, the (strong) law of large numbers states (CB 5.5.9): let $X_1, X_2,...$ be iid random variables with $\mathbb{E}(X_i) = \mu$ and $\mathrm{Var}(X_i) = \sigma^2 < \infty$, and define $\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i$. Then, for every $\epsilon > 0$,

$$P(\lim_{n \to \infty} | \bar{X_n} - \mu| < \epsilon) = 1 $$

We will not talk about convergence in this class. If you are interested, see the relevant CB chapter, or other math/stats courses at UB!

Central Limit Theorem¶

Informally: The limiting distribution of the distribution of the sample mean is a normal distribution

Slightly less informally, as $n \to \infty$, $\sqrt{n} \frac{(\bar{X_n}-\mu)}{\sigma} \sim \mathcal{N}(0,1)$.

Most informally, see CB Thm. 5.5.14!

This is amazing! Because with minimal assumptions, we can bound the distance of our sample mean -- our estimate of the population mean --- to the true sample mean!